In-class Exercise 5: Logistic Regression for Water Points in Osun (Nigeria)

Overview

In this exercise, we will build an explanatory model of the water point status in the state of Osun - Nigeria.

The Data

Two files will be used in this exercise:

  • Osun.rds: the aspatial data of water point status and their attributes

  • Osun_wp_sf.rds: the gespatial data of the Osun state

Variables

Dependent variable: water point status (i.e. functional/non-functional)

Independent variables:

  • distance_to_primary_road

  • distance_to_secondary_road

  • distance_to_tertiary_road

  • distance_to_city

  • distance_to_town

  • water_point_population

  • local_population_1km

  • usage_capacity

  • is_urban

  • water_source_clean

Getting Started

Importing necessary packages

pacman::p_load(sf, tidyverse, funModeling, blorr,
               corrplot, ggpubr, spdep, GWmodel,
               tmap, skimr, caret, report)

Importing files

First, we import the geospatial data of Osun’s LGAs.

osun <- read_rds("data/Osun.rds")

Secondly, we import the aspatial data of water point status and attributes in Osun. The observations with `status

osun_wp_sf <- read_rds("data/Osun_wp_sf.rds")

EDA

Let’s check the status distribution of the water points using freq().

osun_wp_sf %>% 
  freq(input = "status")

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

Now we view the map of Osun.

tmap_mode("view")
tm_shape(osun) +
  tm_polygons(alpha = 0.4) +
  tm_shape(osun_wp_sf) +
  tm_dots(col = "status",
          alpha = 0.6) +
  tm_view(set.zoom.limits = c(9,12))
tmap_mode("plot")

We use skim() from skimr to view the distribution of the variables.

osun_wp_sf %>% 
  skim()
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

We can see that there are some variables with a lot of missing values. We will exclude them from our analysis.

osun_wp_sf_clean <- osun_wp_sf %>% 
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>% 
  mutate(usage_capacity = as.factor(usage_capacity))

Correlation Analysis

osun_wp <- osun_wp_sf_clean %>% 
  select(c(7,35:39,42:43,46:47,57)) %>% 
  st_set_geometry(NULL)
cluster_vars.cor = cor(
  osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Logistic Regression Model

model <- glm(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = osun_wp_sf_clean,
             family = binomial(link = "logit"))
blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

We use the code chunk below to view the confusion matrix of the model.

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Now we need to change our data table from simple feature data frame to spatial point data frame. We need to remove the variables with p-value greater than 0.05, which are distance_to_primary_road and distance_to_secondary_road. Those are the values that do not contribute to our model at the 95% confidence interval.

osun_wp_sp <- osun_wp_sf_clean %>% 
  select(c(status,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean)) %>% 
  as_Spatial()
bw.fixed <- bw.ggwr(status ~
                      distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      water_point_population +
                      local_population_1km +
                      is_urban +
                      usage_capacity +
                      water_source_clean,
                    data = osun_wp_sp,
                    family = "binomial",
                    approach = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2890 
       1        -2837 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5681.18 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2878 
       1        -2820 
       2        -2812 
       3        -2810 
       4        -2810 
       5        -2810 
Fixed bandwidth: 59200.13 AICc value: 5645.901 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2854 
       1        -2790 
       2        -2777 
       3        -2774 
       4        -2774 
       5        -2774 
       6        -2774 
Fixed bandwidth: 36599.53 AICc value: 5585.354 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2810 
       1        -2732 
       2        -2711 
       3        -2707 
       4        -2707 
       5        -2707 
       6        -2707 
Fixed bandwidth: 22631.59 AICc value: 5481.877 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2732 
       1        -2635 
       2        -2604 
       3        -2597 
       4        -2596 
       5        -2596 
       6        -2596 
Fixed bandwidth: 13998.93 AICc value: 5333.718 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2624 
       1        -2502 
       2        -2459 
       3        -2447 
       4        -2446 
       5        -2446 
       6        -2446 
       7        -2446 
Fixed bandwidth: 8663.649 AICc value: 5178.493 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2478 
       1        -2319 
       2        -2250 
       3        -2225 
       4        -2219 
       5        -2219 
       6        -2220 
       7        -2220 
       8        -2220 
       9        -2220 
Fixed bandwidth: 5366.266 AICc value: 5022.016 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2222 
       1        -2002 
       2        -1894 
       3        -1838 
       4        -1818 
       5        -1814 
       6        -1814 
Fixed bandwidth: 3328.371 AICc value: 4827.587 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1837 
       1        -1528 
       2        -1357 
       3        -1261 
       4        -1222 
       5        -1222 
Fixed bandwidth: 2068.882 AICc value: 4772.046 
 Iteration    Log-Likelihood:(With bandwidth:  1290.476 )
=========================
       0        -1403 
       1        -1016 
       2       -807.3 
       3       -680.2 
       4       -680.2 
Fixed bandwidth: 1290.476 AICc value: 5809.72 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -2019 
       1        -1753 
       2        -1614 
       3        -1538 
       4        -1506 
       5        -1506 
Fixed bandwidth: 2549.964 AICc value: 4764.056 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2108 
       1        -1862 
       2        -1736 
       3        -1670 
       4        -1644 
       5        -1644 
Fixed bandwidth: 2847.289 AICc value: 4791.834 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1955 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2366.207 AICc value: 4755.524 
 Iteration    Log-Likelihood:(With bandwidth:  2252.639 )
=========================
       0        -1913 
       1        -1623 
       2        -1465 
       3        -1376 
       4        -1341 
       5        -1341 
Fixed bandwidth: 2252.639 AICc value: 4759.188 
 Iteration    Log-Likelihood:(With bandwidth:  2436.396 )
=========================
       0        -1980 
       1        -1706 
       2        -1560 
       3        -1479 
       4        -1446 
       5        -1446 
Fixed bandwidth: 2436.396 AICc value: 4756.675 
 Iteration    Log-Likelihood:(With bandwidth:  2322.828 )
=========================
       0        -1940 
       1        -1656 
       2        -1503 
       3        -1417 
       4        -1382 
       5        -1382 
Fixed bandwidth: 2322.828 AICc value: 4756.471 
 Iteration    Log-Likelihood:(With bandwidth:  2393.017 )
=========================
       0        -1965 
       1        -1687 
       2        -1539 
       3        -1456 
       4        -1422 
       5        -1422 
Fixed bandwidth: 2393.017 AICc value: 4755.57 
 Iteration    Log-Likelihood:(With bandwidth:  2349.638 )
=========================
       0        -1949 
       1        -1668 
       2        -1517 
       3        -1432 
       4        -1398 
       5        -1398 
Fixed bandwidth: 2349.638 AICc value: 4755.753 
 Iteration    Log-Likelihood:(With bandwidth:  2376.448 )
=========================
       0        -1959 
       1        -1680 
       2        -1530 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2376.448 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2382.777 )
=========================
       0        -1961 
       1        -1683 
       2        -1534 
       3        -1450 
       4        -1416 
       5        -1416 
Fixed bandwidth: 2382.777 AICc value: 4755.491 
 Iteration    Log-Likelihood:(With bandwidth:  2372.536 )
=========================
       0        -1958 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1411 
       5        -1411 
Fixed bandwidth: 2372.536 AICc value: 4755.488 
 Iteration    Log-Likelihood:(With bandwidth:  2378.865 )
=========================
       0        -1960 
       1        -1681 
       2        -1532 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2378.865 AICc value: 4755.481 
 Iteration    Log-Likelihood:(With bandwidth:  2374.954 )
=========================
       0        -1959 
       1        -1679 
       2        -1530 
       3        -1446 
       4        -1412 
       5        -1412 
Fixed bandwidth: 2374.954 AICc value: 4755.482 
 Iteration    Log-Likelihood:(With bandwidth:  2377.371 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.371 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.942 )
=========================
       0        -1960 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2377.942 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.018 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.018 AICc value: 4755.48 
bw.fixed
[1] 2377.371

Geography Weighted Regression Model

Now we do the geography weighted regression (gwRM)

gwlr.fixed <- ggwr.basic(status ~
                           distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           water_point_population +
                           local_population_1km +
                           is_urban +
                           usage_capacity +
                           water_source_clean,
                         data = osun_wp_sp,
                         bw = bw.fixed,
                         family = "binomial",
                         kernel = "gaussian",
                         adaptive = FALSE,
                         longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 

To evaluate the performance of the gwRM, we need to convert it into data frame.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label yhat values greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>% 
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          

When we compare CM with model, we can see that CM has better performance in terms of many parameters including accuracy, sensitivity, and specificity.

osun_wp_sf_selected <- osun_wp_sf_clean %>% 
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE, status))
gwr_sf.fixed <- cbind(osun_wp_sf_selected, gwr.fixed)

Let’s review our CM model in an interactive map.

tmap_mode("view")
prob_T <- tm_shape(osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8,14))
prob_T
tmap_mode("plot")